No description has been provided for this image

Introduction¶

Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:

  • The number of rooms
  • The distance to employment centres
  • How rich or poor the area is
  • How many students there are per teacher in local schools etc
No description has been provided for this image

To accomplish your task you will:

  1. Analyse and explore the Boston house price data
  2. Split your data for training and testing
  3. Run a Multivariable Regression
  4. Evaluate how your model's coefficients and residuals
  5. Use data transformation to improve your model performance
  6. Use your model to estimate a property price

Upgrade plotly (only Google Colab Notebook)¶

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

In [1]:
# %pip install --upgrade plotly

Import Statements¶

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
# TODO: Add missing import statements

Notebook Presentation¶

In [3]:
pd.options.display.float_format = '{:,.2f}'.format

Load the Data¶

The first column in the .csv file just has the row numbers, so it will be used as the index.

In [4]:
data = pd.read_csv('boston.csv', index_col=0)

Understand the Boston House Price Dataset¶


Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.

:Attribute Information (in order):
    1. CRIM     per capita crime rate by town
    2. ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS    proportion of non-retail business acres per town
    4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5. NOX      nitric oxides concentration (parts per 10 million)
    6. RM       average number of rooms per dwelling
    7. AGE      proportion of owner-occupied units built prior to 1940
    8. DIS      weighted distances to five Boston employment centres
    9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT    % lower status of the population
    14. PRICE     Median value of owner-occupied homes in $1000's
    
:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.

Preliminary Data Exploration 🔎¶

Challenge

  • What is the shape of data?
  • How many rows and columns does it have?
  • What are the column names?
  • Are there any NaN values or duplicates?
In [5]:
data.shape
Out[5]:
(506, 14)
In [6]:
data.sample(3)
Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
113 0.22 0.00 10.01 0.00 0.55 6.09 95.40 2.55 6.00 432.00 17.80 396.90 17.09 18.70
265 0.76 20.00 3.97 0.00 0.65 5.56 62.80 1.99 5.00 264.00 13.00 392.40 10.45 22.80
294 0.08 0.00 13.92 0.00 0.44 6.01 42.30 5.50 4.00 289.00 16.00 396.90 10.40 21.70
In [7]:
data.duplicated().values.any()
Out[7]:
False
In [8]:
data.describe()
Out[8]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
count 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00
mean 3.61 11.36 11.14 0.07 0.55 6.28 68.57 3.80 9.55 408.24 18.46 356.67 12.65 22.53
std 8.60 23.32 6.86 0.25 0.12 0.70 28.15 2.11 8.71 168.54 2.16 91.29 7.14 9.20
min 0.01 0.00 0.46 0.00 0.39 3.56 2.90 1.13 1.00 187.00 12.60 0.32 1.73 5.00
25% 0.08 0.00 5.19 0.00 0.45 5.89 45.02 2.10 4.00 279.00 17.40 375.38 6.95 17.02
50% 0.26 0.00 9.69 0.00 0.54 6.21 77.50 3.21 5.00 330.00 19.05 391.44 11.36 21.20
75% 3.68 12.50 18.10 0.00 0.62 6.62 94.07 5.19 24.00 666.00 20.20 396.23 16.96 25.00
max 88.98 100.00 27.74 1.00 0.87 8.78 100.00 12.13 24.00 711.00 22.00 396.90 37.97 50.00

Data Cleaning - Check for Missing Values and Duplicates¶

In [9]:
data.isna().values.any()
Out[9]:
False
In [10]:
data.isnull().values.any()
Out[10]:
False

Descriptive Statistics¶

Challenge

  • How many students are there per teacher on average?
  • What is the average price of a home in the dataset?
  • What is the CHAS feature?
  • What are the minimum and the maximum value of the CHAS and why?
  • What is the maximum and the minimum number of rooms per dwelling in the dataset?
In [11]:
data.PTRATIO.describe() # pupil to teacher ratio
Out[11]:
count   506.00
mean     18.46
std       2.16
min      12.60
25%      17.40
50%      19.05
75%      20.20
max      22.00
Name: PTRATIO, dtype: float64
In [12]:
#average = 18.46. Looks like per one teacher we're getting 18 students
In [13]:
data['PRICE'].describe()
Out[13]:
count   506.00
mean     22.53
std       9.20
min       5.00
25%      17.02
50%      21.20
75%      25.00
max      50.00
Name: PRICE, dtype: float64
In [14]:
data['PRICE'].describe()[1] *1000 # average price of house in $
Out[14]:
22532.806324110676
In [15]:
# CHAS is literally 1/0 like Boolean representation of information that house is next to the river
In [16]:
data['RM'].describe()
Out[16]:
count   506.00
mean      6.28
std       0.70
min       3.56
25%       5.89
50%       6.21
75%       6.62
max       8.78
Name: RM, dtype: float64
In [17]:
# min = 4 max = 9 - AVERAGE number of rooms. 1 and max is 19?

Visualise the Features¶

Challenge: Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn's .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:

  • PRICE: The home price in thousands.
  • RM: the average number of rooms per owner unit.
  • DIS: the weighted distance to the 5 Boston employment centres i.e., the estimated length of the commute.
  • RAD: the index of accessibility to highways.

Try setting the aspect parameter to 2 for a better picture.

What do you notice in the distributions of the data?

In [18]:
# fig, ax = plt.subplots()

sns.displot(data=data, x="PRICE",aspect=2,kde=True,) # kind="kde",aspect=2,ax=ax)
sns.displot(data=data, x="RM",aspect=2,kde=True,)
sns.displot(data=data, x="DIS",aspect=2,kde=True,)
sns.displot(data=data, x="RAD",aspect=2,kde=True,)
# sns.displot(data=data, x="RM", kind="kde",aspect=2,ax=ax)
# sns.displot(data=data, x="DIS", kind="kde",aspect=2,ax=ax)
# sns.displot(data=data, x="RAD", kind="kde",aspect=2,ax=ax)

# sns.plt.show()
Out[18]:
<seaborn.axisgrid.FacetGrid at 0x7b331e9eb2e0>
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [19]:
# Index of accesibility to highways was a quite suprising looks like there is
# some kind of factory distinct or something like this

# 6/7 rooms? Hugeeee, those are kind of standalone houses?

# Prices are not suprising, but i assume that there is some low cost and high
# cost areas

#  Separated plots are below

House Prices 💰¶

In [20]:
sns.displot(data=data, x="PRICE",aspect=2,kde=True,)
Out[20]:
<seaborn.axisgrid.FacetGrid at 0x7b331c7e4b80>
No description has been provided for this image

Distance to Employment - Length of Commute 🚗¶

In [21]:
sns.displot(data=data, x="DIS",aspect=2,kde=True,)
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x7b331c711e70>
No description has been provided for this image

Number of Rooms¶

In [22]:
sns.displot(data=data, x="RM",aspect=2,kde=True,)
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x7b331c8115d0>
No description has been provided for this image

Access to Highways 🛣¶

In [23]:
sns.displot(data=data, x="RAD",aspect=2,kde=True,)
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x7b331c5e3100>
No description has been provided for this image

Next to the River? ⛵️¶

Challenge

Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it. The bar chart should look something like this:

No description has been provided for this image

You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])

In [24]:
data['CHAS'].value_counts()
Out[24]:
0.00    471
1.00     35
Name: CHAS, dtype: int64
In [25]:
bar = px.bar(x=['No', 'Yes'],
             y=data['CHAS'].value_counts(),
             color=data['CHAS'].value_counts(),
             )

bar.show()
No description has been provided for this image

Understand the Relationships in the Data¶

Run a Pair Plot¶

Challenge

There might be some relationships in the data that we should know about. Before you run the code, make some predictions:

  • What would you expect the relationship to be between pollution (NOX) and the distance to employment (DIS)?
  • What kind of relationship do you expect between the number of rooms (RM) and the home value (PRICE)?
  • What about the amount of poverty in an area (LSTAT) and home prices?

Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it's finished check your intuition regarding the questions above on the pairplot.

In [26]:
# higher distance = lower polution
# higher_num_of_rooms == higher_price, but not on suburbs
## higher LSTAT lower home price in area
In [27]:
sns.pairplot(data)
Out[27]:
<seaborn.axisgrid.PairGrid at 0x7b331b947610>
No description has been provided for this image
In [27]:
 

Challenge

Use Seaborn's .jointplot() to look at some of the relationships in more detail. Create a jointplot for:

  • DIS and NOX
  • INDUS vs NOX
  • LSTAT vs RM
  • LSTAT vs PRICE
  • RM vs PRICE

Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.

Distance from Employment vs. Pollution¶

Challenge:

Compare DIS (Distance from employment) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as the distance increases?

In [28]:
sns.jointplot(data=data,
              x='DIS',
              y='NOX',
              hue='CHAS',
              )
Out[28]:
<seaborn.axisgrid.JointGrid at 0x7b331c50ead0>
No description has been provided for this image

Proportion of Non-Retail Industry 🏭🏭🏭 versus Pollution¶

Challenge:

Compare INDUS (the proportion of non-retail industry i.e., factories) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as there is a higher proportion of industry?

In [29]:
sns.jointplot(data=data,
              x='INDUS',
              y='NOX',
              hue='CHAS',
              )
Out[29]:
<seaborn.axisgrid.JointGrid at 0x7b33149bfca0>
No description has been provided for this image

% of Lower Income Population vs Average Number of Rooms¶

Challenge

Compare LSTAT (proportion of lower-income population) with RM (number of rooms) using Seaborn's .jointplot(). How does the number of rooms per dwelling vary with the poverty of area? Do homes have more or fewer rooms when LSTAT is low?

In [30]:
sns.jointplot(data=data,
              x='LSTAT',
              y='RM',
              hue='CHAS',
              )
Out[30]:
<seaborn.axisgrid.JointGrid at 0x7b330f118af0>
No description has been provided for this image

% of Lower Income Population versus Home Price¶

Challenge

Compare LSTAT with PRICE using Seaborn's .jointplot(). How does the proportion of the lower-income population in an area affect home prices?

In [31]:
sns.jointplot(data=data,
              x='LSTAT',
              y='PRICE',
              hue='CHAS',
              )
Out[31]:
<seaborn.axisgrid.JointGrid at 0x7b330f00cd00>
No description has been provided for this image

Number of Rooms versus Home Value¶

Challenge

Compare RM (number of rooms) with PRICE using Seaborn's .jointplot(). You can probably guess how the number of rooms affects home prices. 😊

In [32]:
sns.jointplot(data=data,
              x='RM',
              y='PRICE',
              hue='DIS',
              )
Out[32]:
<seaborn.axisgrid.JointGrid at 0x7b330ede57e0>
No description has been provided for this image
In [33]:
# THE 6 ROOM flats which are nearby to employment centres are the cheapest???
In [34]:
# the most wealthy ppl like to live next to the river

Split Training & Test Dataset¶

We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.

Challenge

  • Import the train_test_split() function from sklearn
  • Create 4 subsets: X_train, X_test, y_train, y_test
  • Split the training and testing data roughly 80/20.
  • To get the same random split every time you run your notebook use random_state=10. This helps us get the same results every time and avoid confusion while we're learning.

Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.

In [35]:
X = data.iloc[:,:-1]
y=data.iloc[:,-1]
In [36]:
from sklearn.model_selection import train_test_split
In [37]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.2,
                                                    random_state=10)

Multivariable Regression¶

In a previous lesson, we had a linear model with only a single feature (our movie budgets). This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:

$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$

Run Your First Regression¶

Challenge

Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data?

In [38]:
from sklearn.linear_model import LinearRegression

regression = LinearRegression()
regression.fit(X_train,y_train)
Out[38]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [39]:
regression.score(X_test,y_test)
Out[39]:
0.6709339839115642

Evaluate the Coefficients of the Model¶

Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).

Challenge Print out the coefficients (the thetas in the equation above) for the features. Hint: You'll see a nice table if you stick the coefficients in a DataFrame.

  • We already saw that RM on its own had a positive relation to PRICE based on the scatter plot. Is RM's coefficient also positive?
  • What is the sign on the LSAT coefficient? Does it match your intuition and the scatter plot above?
  • Check the other coefficients. Do they have the expected sign?
  • Based on the coefficients, how much more expensive is a room with 6 rooms compared to a room with 5 rooms? According to the model, what is the premium you would have to pay for an extra room?
In [40]:
regression.intercept_
Out[40]:
36.53305138282431
In [41]:
regression.coef_
Out[41]:
array([-1.28180656e-01,  6.31981786e-02, -7.57627602e-03,  1.97451452e+00,
       -1.62719890e+01,  3.10845625e+00,  1.62922153e-02, -1.48301360e+00,
        3.03988206e-01, -1.20820710e-02, -8.20305699e-01,  1.14189890e-02,
       -5.81626431e-01])
In [42]:
regression.coef_.shape
Out[42]:
(13,)
In [43]:
data.columns[:-1]
Out[43]:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')
In [44]:
dict(map(lambda i,j : (i,j) , data.columns[:-1],regression.coef_))
Out[44]:
{'CRIM': -0.12818065642264795,
 'ZN': 0.06319817864608888,
 'INDUS': -0.00757627601533797,
 'CHAS': 1.9745145165622597,
 'NOX': -16.271988951469734,
 'RM': 3.1084562454033,
 'AGE': 0.01629221534560711,
 'DIS': -1.4830135966050273,
 'RAD': 0.30398820612116106,
 'TAX': -0.012082071043592574,
 'PTRATIO': -0.8203056992885642,
 'B': 0.011418989022213357,
 'LSTAT': -0.581626431182139}
In [45]:
# according to last question mby we can make some examples instead pure math?

# in case if you answered: NO!!!

# Looks like we have to pay 3k $ for extra room

##
In [46]:
## Lets find some flat with 6 rooms
In [47]:
sample_x = data.sample(1)
print(sample_x)
     CRIM   ZN  INDUS  CHAS  NOX   RM    AGE  DIS   RAD    TAX  PTRATIO  \
379 17.87 0.00  18.10  0.00 0.67 6.22 100.00 1.39 24.00 666.00    20.20   

         B  LSTAT  PRICE  
379 393.74  21.78  10.20  
In [48]:
sample_x
Out[48]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
379 17.87 0.00 18.10 0.00 0.67 6.22 100.00 1.39 24.00 666.00 20.20 393.74 21.78 10.20
In [49]:
sample_x = sample_x.drop(['PRICE'],axis=1)
In [50]:
sample_x
Out[50]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
379 17.87 0.00 18.10 0.00 0.67 6.22 100.00 1.39 24.00 666.00 20.20 393.74 21.78
In [51]:
regression.predict(sample_x)
Out[51]:
array([16.61196204])
In [52]:
sample_x['RM'] = 5.0
In [53]:
sample_x
Out[53]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
379 17.87 0.00 18.10 0.00 0.67 5.00 100.00 1.39 24.00 666.00 20.20 393.74 21.78
In [54]:
regression.predict(sample_x)
Out[54]:
array([12.81032005])
In [55]:
# in my run it gave me diff on lvl of 3,5k - 4k thousands of dolars
In [56]:
sample_x['RM'] = 6.0
In [57]:
regression.predict(sample_x)
Out[57]:
array([15.9187763])

Analyse the Estimated Values & Regression Residuals¶

The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions ($\hat y_i$) and the true values ($y_i$) inside y_train.

predicted_values = regr.predict(X_train)
residuals = (y_train - predicted_values)

Challenge: Create two scatter plots.

The first plot should be actual values (y_train) against the predicted value values:

No description has been provided for this image

The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊

The second plot should be the residuals against the predicted prices. Here's what we're looking for:

No description has been provided for this image
In [58]:
y_train_predicted = regression.predict(X_train)
In [59]:
residuals = (y_train - y_train_predicted )
In [60]:
plt.figure(figsize=(8,4), dpi=200)

ax = sns.scatterplot(x=y_train,
                    y=y_train_predicted,)

# plt.plot(y_train,y_train)
sns.scatterplot(x=y_train,
                    y=y_train,
                    c='cyan',)

ax.set(xlabel='Real price',ylabel='Our predictions')

plt.show()
No description has been provided for this image
In [61]:
plt.figure(figsize=(8,4), dpi=200)

ax = sns.scatterplot(y=residuals,
                    x=y_train_predicted,
                     hue=residuals,)


ax.set(xlabel='Predicted price',ylabel='Divergence')

plt.show()
No description has been provided for this image

Why do we want to look at the residuals? We want to check that they look random. Why? The residuals represent the errors of our model. If there's a pattern in our errors, then our model has a systematic bias.

We can analyse the distribution of the residuals. In particular, we're interested in the skew and the mean.

In an ideal case, what we want is something close to a normal distribution. A normal distribution has a skewness of 0 and a mean of 0. A skew of 0 means that the distribution is symmetrical - the bell curve is not lopsided or biased to one side. Here's what a normal distribution looks like:

No description has been provided for this image

Challenge

  • Calculate the mean and the skewness of the residuals.
  • Again, use Seaborn's .displot() to create a histogram and superimpose the Kernel Density Estimate (KDE)
  • Is the skewness different from zero? If so, by how much?
  • Is the mean different from zero?
In [62]:
df_residuals = pd.DataFrame( [{'skewness':residuals.skew() , 'mean':residuals.mean(),  }] )
In [63]:
sns.displot(data=df_residuals,x='skewness')
Out[63]:
<seaborn.axisgrid.FacetGrid at 0x7b330ea3c460>
No description has been provided for this image
In [64]:
## something went wrong
In [65]:
sns.distplot(residuals, hist=True, kde=True, bins=20)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Density")
plt.show()
<ipython-input-65-e3fba22c66e5>:1: UserWarning:



`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751


No description has been provided for this image
In [65]:
 

Data Transformations for a Better Fit¶

We have two options at this point:

  1. Change our model entirely. Perhaps a linear model is not appropriate.
  2. Transform our data to make it fit better with our linear model.

Let's try a data transformation approach.

Challenge

Investigate if the target data['PRICE'] could be a suitable candidate for a log transformation.

  • Use Seaborn's .displot() to show a histogram and KDE of the price data.
  • Calculate the skew of that distribution.
  • Use NumPy's log() function to create a Series that has the log prices
  • Plot the log prices using Seaborn's .displot() and calculate the skew.
  • Which distribution has a skew that's closer to zero?
In [ ]:
# looks like log distribution has skew which is closer to zero
In [76]:
sns.distplot(data['PRICE'], hist=True, kde=True, bins=20)
plt.title("Distribution of Price")
plt.xlabel("Price")
plt.ylabel("Density")
plt.show()
<ipython-input-76-1e7b0b052606>:1: UserWarning:



`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751


No description has been provided for this image
In [75]:
print(f"Skewness: {round(data['PRICE'].skew(),3)}")
Skewness: 1.108
In [71]:
price_log = np.log(data['PRICE'])
In [77]:
sns.distplot(price_log, hist=True, kde=True, bins=20)
plt.title("Log Distribution of Price")
plt.xlabel("Log Price")
plt.ylabel("Density")
plt.show()
<ipython-input-77-ec60f2748e10>:1: UserWarning:



`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751


No description has been provided for this image
In [78]:
print(f"Skewness: {round(price_log.skew(),3)}")
Skewness: -0.33

How does the log transformation work?¶

Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:

No description has been provided for this image

We can see this when we plot the actual prices against the (transformed) log prices.

In [66]:
plt.figure(dpi=150)
plt.scatter(data.PRICE, np.log(data.PRICE))

plt.title('Mapping the Original Price to a Log Price')
plt.ylabel('Log Price')
plt.xlabel('Actual $ Price in 000s')
plt.show()
No description has been provided for this image

Regression using Log Prices¶

Using log prices instead, our model has changed to:

$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

Challenge:

  • Use train_test_split() with the same random state as before to make the results comparable.
  • Run a second regression, but this time use the transformed target data.
  • What is the r-squared of the regression on the training data?
  • Have we improved the fit of our model compared to before based on this measure?
In [79]:
y = np.log(data['PRICE'])
In [80]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.2,
                                                    random_state=10)
In [81]:
regression = LinearRegression()
regression.fit(X_train,y_train)
Out[81]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [84]:
regression.score(X_test,y_test) # increase over 10%, nice
Out[84]:
0.7446922306260739
In [ ]:
 
In [ ]:
 

Evaluating Coefficients with Log Prices¶

Challenge: Print out the coefficients of the new regression model.

  • Do the coefficients still have the expected sign?
  • Is being next to the river a positive based on the data?
  • How does the quality of the schools affect property prices? What happens to prices as there are more students per teacher?

Hint: Use a DataFrame to make the output look pretty.

In [86]:
regression.coef_
Out[86]:
array([-1.06717261e-02,  1.57929102e-03,  2.02989827e-03,  8.03305301e-02,
       -7.04068057e-01,  7.34044072e-02,  7.63301755e-04, -4.76332789e-02,
        1.45651350e-02, -6.44998303e-04, -3.47947628e-02,  5.15896157e-04,
       -3.13900565e-02])
In [87]:
dict(map(lambda i,j : (i,j) , data.columns[:-1],regression.coef_))
Out[87]:
{'CRIM': -0.010671726123550168,
 'ZN': 0.0015792910237792527,
 'INDUS': 0.0020298982724026335,
 'CHAS': 0.0803305301283453,
 'NOX': -0.7040680570150262,
 'RM': 0.0734044072331749,
 'AGE': 0.0007633017550354285,
 'DIS': -0.04763327892124784,
 'RAD': 0.014565134991367565,
 'TAX': -0.0006449983030440104,
 'PTRATIO': -0.03479476276651838,
 'B': 0.0005158961569951322,
 'LSTAT': -0.03139005646263394}
In [88]:
# CHAS - river bool is positive, barely, cuz it's very small
In [ ]:
# More students == lower price

Regression with Log Prices & Residual Plots¶

Challenge:

  • Copy-paste the cell where you've created scatter plots of the actual versus the predicted home prices as well as the residuals versus the predicted values.
  • Add 2 more plots to the cell so that you can compare the regression outcomes with the log prices side by side.
  • Use indigo as the colour for the original regression and navy for the color using log prices.
In [90]:
y_train_predicted = regression.predict(X_train)
In [91]:
plt.figure(figsize=(8,4), dpi=200)

ax = sns.scatterplot(x=y_train,
                    y=y_train_predicted,)

# plt.plot(y_train,y_train)
sns.scatterplot(x=y_train,
                    y=y_train,
                    c='cyan',)

ax.set(xlabel='Real price',ylabel='Our predictions')

plt.show()
No description has been provided for this image
In [92]:
residuals = (y_train - y_train_predicted )
In [93]:
plt.figure(figsize=(8,4), dpi=200)

ax = sns.scatterplot(y=residuals,
                    x=y_train_predicted,
                     hue=residuals,)


ax.set(xlabel='Predicted price',ylabel='Divergence')

plt.show()
No description has been provided for this image
In [94]:
# i don't understand rest of this challenge

Challenge:

Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?

In [95]:
sns.distplot(residuals, hist=True, kde=True, bins=20)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Density")
plt.show()
<ipython-input-95-e3fba22c66e5>:1: UserWarning:



`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751


No description has been provided for this image
In [66]:
 

Compare Out of Sample Performance¶

The real test is how our model performs on data that it has not "seen" yet. This is where our X_test comes in.

Challenge

Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?

In [96]:
regression.score(X_test,y_test) # w8 i did it earlier. To check!!!
Out[96]:
0.7446922306260739
In [66]:
 

Predict a Property's Value using the Regression Coefficients¶

Our preferred model now has an equation that looks like this:

$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

The average property has the mean value for all its charactistics:

In [98]:
# Starting Point: Average Values in the Dataset
features = data.drop(['PRICE'], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)),
                              columns=features.columns)
property_stats
Out[98]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 3.61 11.36 11.14 0.07 0.55 6.28 68.57 3.80 9.55 408.24 18.46 356.67 12.65

Challenge

Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You'll have to reverse the log transformation with .exp() to find the dollar value.

In [99]:
predicted_avg = regression.predict(property_stats)
In [102]:
np.exp(predicted_avg) * 1000 # price in $
Out[102]:
array([20703.17832102])
In [ ]:
 

Challenge

Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:

In [103]:
# Define Property Characteristics
next_to_river = True
nr_rooms = 8
students_per_classroom = 20
distance_to_town = 5
pollution = data.NOX.quantile(q=0.75) # high
amount_of_poverty =  data.LSTAT.quantile(q=0.25) # low
In [104]:
# Solution:

particular_property_example = property_stats.copy()

particular_property_example['CHAS'] = 1
particular_property_example['RM'] = nr_rooms
particular_property_example['PTRATIO'] = students_per_classroom
particular_property_example['DIS'] = distance_to_town
particular_property_example['NOX'] = pollution
particular_property_example['LSTAT'] = amount_of_poverty
In [105]:
predicted_price_of_example = regression.predict(particular_property_example)
In [108]:
np.exp(predicted_price_of_example) * 1000 # price in $
Out[108]:
array([25792.0258724])
In [ ]: